CASE STUDY - 1 :: Healthcare Provider Fraudulent Detection¶
1. This notebook contains the extensive data analysis of { BENE + IP + OP + Frauds Providers Labels } performed on the publicly available dataset at Kaggle with the intent:
2. I have also added some new features with the intent to bring some more business aspects in the dataset. Also, performed their impact analysis on the potential frauds.
Kindly checkout the below Deck for better understanding the BUSINESS oriented insights about this problem:
Kindly checkout the below Doc for TECHNICAL Design description about this problem:
Notebook Contents¶CASE STUDY - 1 :: Healthcare Provider Fraudulent Detection
Adding the Admitted or Not Admitted indicator in IP and OP Dataset
Merging the IP_OP Dataset with BENE Data
Merging the IP_OP_BENE Dataset with PROVIDER level Tgt Labels Data
Feature Engineering + Impact Analysis
Adding New Feature :: Is_Alive?
Adding New Feature :: Claim_Duration
Adding New Feature :: Admitted_Duration
Adding New Feature :: Bene_Age
Does InscClaimAmtReimbursed influences Potentially Fraud?
Does IPAnnualReimbursementAmt influences Potentially Fraud?
Why do we have IP Annual Re-Imb Amount as 0 for Admitted Patients?
Does OPAnnualReimbursementAmt influences Potentially Fraud?
Why do we have OP Annual Re-Imb Amount as 0 for Admitted Patients?
Adding New Feature :: Total Number of false claims filed by a Provider
Adding New Feature :: Total Number of claims or cases seen by Attending Physician
Adding New Feature :: Total Number of claims or cases seen by Opearting Physician
Adding New Feature :: Total Number of claims or cases seen by Other Physician
Adding Combined Feature :: Att_Opr_Oth_Phy_Tot_Claims
Adding 3 New Features :: Prv_Tot_Att_Phy, Prv_Tot_Opr_Phy and Prv_Tot_Oth_Phy
Adding Combined Feature :: Prv_Tot_Att_Opr_Oth_Phys
Adding New Feature :: Total Unique Claim Admit Codes used by a PROVIDER
Adding New Feature :: Total Unique Number of Diagnosis Group Codes used by a PROVIDER
Adding New Feature :: Total unique Date of Birth years of beneficiaries provided by a Provider
Adding New Feature :: Sum of patients age treated by a Provider
Adding New Feature :: Sum of Insc Claim Re-Imb Amount for a Provider
Adding New Feature :: Total number of RKD Patients seen by a Provider
Q1. Which are the Top-25 Providers with maximum number of fraudulent cases?
Q2. Which are the Top-25 Providers with maximum number of non-fraudulent cases?
Q3. Which are the Top-25 Attending Physicians with maximum number of fraudulent cases?
Q4. Which are the Top-25 Attenting Physicians with maximum number of non-fraudulent cases?
Q5. Which are the Top-25 Operating Physicians with maximum number of fraudulent cases?
Q6. Which are the Top-25 Operating Physicians with maximum number of non-fraudulent cases?
Q7. Which are the Top-25 Other Physicians with maximum number of fraudulent cases?
Q8. Which are the Top-25 Other Physicians with maximum number of non-fraudulent cases?
Q9. Which are the Top-25 ClmAdmitDiagnosisCode with maximum number of fraudulent cases?
Q10. Which are the Top-25 ClmAdmitDiagnosisCode with maximum number of non-fraudulent cases?
Q11. Which are the Top-25 DiagnosisGroupCode with maximum number of fraudulent cases?
Q12. Which are the Top-25 DiagnosisGroupCode with maximum number of non-fraudulent cases?
Q13. Does Age_groups have any relationship with maximum number of fraudulent cases?
Q14. Does Age_groups have any relationship with maximum number of non-fraudulent cases?
Q15. Which are the Top-25 States with maximum number of fraudulent cases?
Q16. What are the Top-25 States with maximum number of non-fraudulent cases?
Q17. Which are the Top-25 Country with maximum number of fraudulent cases?
Q18. What are the Top-25 Country with maximum number of non-fraudulent cases?
Q19. Does various Human Races have any relationship with maximum number of fraudulent cases?
Q20. Does various Human Races have any relationship with maximum number of non-fraudulent cases?
## TRAIN set files
!gdown 12zSQN2FOxmuXFhz2xzPNussPisEfVP5w
!gdown 13XyBakfHiG-BNQPrYFXAHlsOcfICOTpx
!gdown 1dLxl4vkykPcm4Zj0abYR0Ohr7STQHg-1
!gdown 1rFER-7VuYb7GfCYeJrfxPidgK0lwqw3R
import os
import sys
import math
import scipy as scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns',80)
label_font_dict = {'family':'sans-serif','size':13.5,'color':'brown','style':'italic'}
title_font_dict = {'family':'sans-serif','size':16.5,'color':'Blue','style':'italic'}
train_bene_df = pd.read_csv("Dataset/Train/Train_Beneficiarydata-1542865627584.csv")
train_ip_df = pd.read_csv("Dataset/Train/Train_Inpatientdata-1542865627584.csv")
train_op_df = pd.read_csv("Dataset/Train/Train_Outpatientdata-1542865627584.csv")
train_tgt_lbls_df = pd.read_csv("Dataset/Train/Train-1542865627584.csv")
train_tgt_lbls_df.head()
| Provider | PotentialFraud | |
|---|---|---|
| 0 | PRV51001 | No |
| 1 | PRV51003 | Yes |
| 2 | PRV51004 | No |
| 3 | PRV51005 | Yes |
| 4 | PRV51007 | No |
print("### The unique number of providers are {}. ###".format(train_tgt_lbls_df.shape[0]))
### The unique number of providers are 5410. ###
with plt.style.context('seaborn-poster'):
fig = train_tgt_lbls_df["PotentialFraud"].value_counts().plot(kind='bar', color=['green','orange'])
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/train_tgt_lbls_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
# Providing the labels and title to the graph
plt.xlabel("Provider Fraud or Not?", fontdict=label_font_dict)
plt.ylabel("Number or % share of providers\n", fontdict=label_font_dict)
plt.yticks(np.arange(0,5100,500))
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.minorticks_on()
plt.title("Distribution of Fraud & Non-fraud providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Admitted or Not Admitted indicator in IP and OP Dataset¶train_ip_df["Admitted?"] = 1
train_ip_df.head()
| BeneID | ClaimID | ClaimStartDt | ClaimEndDt | Provider | InscClaimAmtReimbursed | AttendingPhysician | OperatingPhysician | OtherPhysician | AdmissionDt | ClmAdmitDiagnosisCode | DeductibleAmtPaid | DischargeDt | DiagnosisGroupCode | ClmDiagnosisCode_1 | ClmDiagnosisCode_2 | ClmDiagnosisCode_3 | ClmDiagnosisCode_4 | ClmDiagnosisCode_5 | ClmDiagnosisCode_6 | ClmDiagnosisCode_7 | ClmDiagnosisCode_8 | ClmDiagnosisCode_9 | ClmDiagnosisCode_10 | ClmProcedureCode_1 | ClmProcedureCode_2 | ClmProcedureCode_3 | ClmProcedureCode_4 | ClmProcedureCode_5 | ClmProcedureCode_6 | Admitted? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | BENE11001 | CLM46614 | 2009-04-12 | 2009-04-18 | PRV55912 | 26000 | PHY390922 | NaN | NaN | 2009-04-12 | 7866 | 1068.0 | 2009-04-18 | 201 | 1970 | 4019 | 5853 | 7843 | 2768 | 71590 | 2724 | 19889 | 5849 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 1 | BENE11001 | CLM66048 | 2009-08-31 | 2009-09-02 | PRV55907 | 5000 | PHY318495 | PHY318495 | NaN | 2009-08-31 | 6186 | 1068.0 | 2009-09-02 | 750 | 6186 | 2948 | 56400 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7092.0 | NaN | NaN | NaN | NaN | NaN | 1 |
| 2 | BENE11001 | CLM68358 | 2009-09-17 | 2009-09-20 | PRV56046 | 5000 | PHY372395 | NaN | PHY324689 | 2009-09-17 | 29590 | 1068.0 | 2009-09-20 | 883 | 29623 | 30390 | 71690 | 34590 | V1581 | 32723 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 3 | BENE11011 | CLM38412 | 2009-02-14 | 2009-02-22 | PRV52405 | 5000 | PHY369659 | PHY392961 | PHY349768 | 2009-02-14 | 431 | 1068.0 | 2009-02-22 | 067 | 43491 | 2762 | 7843 | 32723 | V1041 | 4254 | 25062 | 40390 | 4019 | NaN | 331.0 | NaN | NaN | NaN | NaN | NaN | 1 |
| 4 | BENE11014 | CLM63689 | 2009-08-13 | 2009-08-30 | PRV56614 | 10000 | PHY379376 | PHY398258 | NaN | 2009-08-13 | 78321 | 1068.0 | 2009-08-30 | 975 | 042 | 3051 | 34400 | 5856 | 42732 | 486 | 5119 | 29620 | 20300 | NaN | 3893.0 | NaN | NaN | NaN | NaN | NaN | 1 |
train_op_df["Admitted?"] = 0
train_op_df.head()
| BeneID | ClaimID | ClaimStartDt | ClaimEndDt | Provider | InscClaimAmtReimbursed | AttendingPhysician | OperatingPhysician | OtherPhysician | ClmDiagnosisCode_1 | ClmDiagnosisCode_2 | ClmDiagnosisCode_3 | ClmDiagnosisCode_4 | ClmDiagnosisCode_5 | ClmDiagnosisCode_6 | ClmDiagnosisCode_7 | ClmDiagnosisCode_8 | ClmDiagnosisCode_9 | ClmDiagnosisCode_10 | ClmProcedureCode_1 | ClmProcedureCode_2 | ClmProcedureCode_3 | ClmProcedureCode_4 | ClmProcedureCode_5 | ClmProcedureCode_6 | DeductibleAmtPaid | ClmAdmitDiagnosisCode | Admitted? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | BENE11002 | CLM624349 | 2009-10-11 | 2009-10-11 | PRV56011 | 30 | PHY326117 | NaN | NaN | 78943 | V5866 | V1272 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 56409 | 0 |
| 1 | BENE11003 | CLM189947 | 2009-02-12 | 2009-02-12 | PRV57610 | 80 | PHY362868 | NaN | NaN | 6115 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 79380 | 0 |
| 2 | BENE11003 | CLM438021 | 2009-06-27 | 2009-06-27 | PRV57595 | 10 | PHY328821 | NaN | NaN | 2723 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 3 | BENE11004 | CLM121801 | 2009-01-06 | 2009-01-06 | PRV56011 | 40 | PHY334319 | NaN | NaN | 71988 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 |
| 4 | BENE11004 | CLM150998 | 2009-01-22 | 2009-01-22 | PRV56011 | 200 | PHY403831 | NaN | NaN | 82382 | 30000 | 72887 | 4280 | 7197 | V4577 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 71947 | 0 |
# Commom columns must be 28
common_cols = [col for col in train_ip_df.columns if col in train_op_df.columns]
len(common_cols)
28
# Merging the IP and OP dataset on the basis of common columns
train_ip_op_df = pd.merge(left=train_ip_df, right=train_op_df, left_on=common_cols, right_on=common_cols, how="outer")
train_ip_op_df.shape
(558211, 31)
train_ip_op_df.head()
| BeneID | ClaimID | ClaimStartDt | ClaimEndDt | Provider | InscClaimAmtReimbursed | AttendingPhysician | OperatingPhysician | OtherPhysician | AdmissionDt | ClmAdmitDiagnosisCode | DeductibleAmtPaid | DischargeDt | DiagnosisGroupCode | ClmDiagnosisCode_1 | ClmDiagnosisCode_2 | ClmDiagnosisCode_3 | ClmDiagnosisCode_4 | ClmDiagnosisCode_5 | ClmDiagnosisCode_6 | ClmDiagnosisCode_7 | ClmDiagnosisCode_8 | ClmDiagnosisCode_9 | ClmDiagnosisCode_10 | ClmProcedureCode_1 | ClmProcedureCode_2 | ClmProcedureCode_3 | ClmProcedureCode_4 | ClmProcedureCode_5 | ClmProcedureCode_6 | Admitted? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | BENE11001 | CLM46614 | 2009-04-12 | 2009-04-18 | PRV55912 | 26000 | PHY390922 | NaN | NaN | 2009-04-12 | 7866 | 1068.0 | 2009-04-18 | 201 | 1970 | 4019 | 5853 | 7843 | 2768 | 71590 | 2724 | 19889 | 5849 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 1 | BENE11001 | CLM66048 | 2009-08-31 | 2009-09-02 | PRV55907 | 5000 | PHY318495 | PHY318495 | NaN | 2009-08-31 | 6186 | 1068.0 | 2009-09-02 | 750 | 6186 | 2948 | 56400 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7092.0 | NaN | NaN | NaN | NaN | NaN | 1 |
| 2 | BENE11001 | CLM68358 | 2009-09-17 | 2009-09-20 | PRV56046 | 5000 | PHY372395 | NaN | PHY324689 | 2009-09-17 | 29590 | 1068.0 | 2009-09-20 | 883 | 29623 | 30390 | 71690 | 34590 | V1581 | 32723 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 3 | BENE11011 | CLM38412 | 2009-02-14 | 2009-02-22 | PRV52405 | 5000 | PHY369659 | PHY392961 | PHY349768 | 2009-02-14 | 431 | 1068.0 | 2009-02-22 | 067 | 43491 | 2762 | 7843 | 32723 | V1041 | 4254 | 25062 | 40390 | 4019 | NaN | 331.0 | NaN | NaN | NaN | NaN | NaN | 1 |
| 4 | BENE11014 | CLM63689 | 2009-08-13 | 2009-08-30 | PRV56614 | 10000 | PHY379376 | PHY398258 | NaN | 2009-08-13 | 78321 | 1068.0 | 2009-08-30 | 975 | 042 | 3051 | 34400 | 5856 | 42732 | 486 | 5119 | 29620 | 20300 | NaN | 3893.0 | NaN | NaN | NaN | NaN | NaN | 1 |
# Joining the IP_OP dataset with the BENE data
train_ip_op_bene_df = pd.merge(left=train_ip_op_df, right=train_bene_df, left_on='BeneID', right_on='BeneID',how='inner')
train_ip_op_bene_df.shape
(558211, 55)
# Joining the IP_OP_BENE dataset with the Tgt Label Provider Data
train_iobp_df = pd.merge(left=train_ip_op_bene_df, right=train_tgt_lbls_df, left_on='Provider', right_on='Provider',how='inner')
train_iobp_df.shape
(558211, 56)
train_iobp_df.shape
(558211, 56)
# Unique Providers
train_iobp_df["Provider"].nunique()
5410
# Unique Claim numbers
train_iobp_df["ClaimID"].nunique()
558211
ASSUMPTION :: One provider may have been involved in more than one claim. So, does all the claims filed by a potentially fraud provider are all frauds?¶- This cannot holds True for all the providers because if one provider has filed say 50 claims then we can't say that all the claims for that provider are fraudulent.
- There may exists a pattern that out of 50 claims a provider files 1 or 2 fraudulent claims.
Therefore, it is a big assumption to make that all the claims filed by a potentially fraud provider are fraudulent.¶prvs_claims_df = pd.DataFrame(train_iobp_df.groupby(['Provider'])['ClaimID'].count()).reset_index()
prvs_claims_tgt_lbls_df = pd.merge(left=prvs_claims_df, right=train_tgt_lbls_df, on='Provider', how='inner')
prvs_claims_tgt_lbls_df
| Provider | ClaimID | PotentialFraud | |
|---|---|---|---|
| 0 | PRV51001 | 25 | No |
| 1 | PRV51003 | 132 | Yes |
| 2 | PRV51004 | 149 | No |
| 3 | PRV51005 | 1165 | Yes |
| 4 | PRV51007 | 72 | No |
| ... | ... | ... | ... |
| 5405 | PRV57759 | 28 | No |
| 5406 | PRV57760 | 22 | No |
| 5407 | PRV57761 | 82 | No |
| 5408 | PRV57762 | 1 | No |
| 5409 | PRV57763 | 118 | No |
5410 rows × 3 columns
OBSERVATION
print(pd.DataFrame(train_iobp_df['PotentialFraud'].value_counts()), "\n")
with plt.style.context('seaborn-poster'):
fig = train_iobp_df['PotentialFraud'].value_counts().plot(kind='bar', color=['green','orange'])
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/train_iobp_df.shape[0],2))+"%"}', (x + width/2, y + height*1.015), ha='center', fontsize=13.5)
# Providing the labels and title to the graph
plt.xlabel("Fraud or Not?", fontdict=label_font_dict)
plt.ylabel("Number (or %) of claims\n", fontdict=label_font_dict)
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.minorticks_on()
plt.title("Distribution of Fraud & Non-fraud claims\n", fontdict=title_font_dict)
plt.plot();
PotentialFraud No 345415 Yes 212796
OBSERVATION
Let's create some features
New Feature :: Is_Alive?¶- Is Alive? = No if DOD is NaN else Yes
train_iobp_df['DOB'] = pd.to_datetime(train_iobp_df['DOB'], format="%Y-%m-%d")
train_iobp_df['DOD'] = pd.to_datetime(train_iobp_df['DOD'], format="%Y-%m-%d")
train_iobp_df['Is_Alive?'] = train_iobp_df['DOD'].apply(lambda val: 'No' if val != val else 'Yes')
train_iobp_df['Is_Alive?'].value_counts()
No 554080 Yes 4131 Name: Is_Alive?, dtype: int64
New Feature :: Claim_Duration¶- Claim Duration = Claim End Date - Claim Start Date
train_iobp_df['ClaimStartDt'] = pd.to_datetime(train_iobp_df['ClaimStartDt'], format="%Y-%m-%d")
train_iobp_df['ClaimEndDt'] = pd.to_datetime(train_iobp_df['ClaimEndDt'], format="%Y-%m-%d")
train_iobp_df['Claim_Duration'] = (train_iobp_df['ClaimEndDt'] - train_iobp_df['ClaimStartDt']).dt.days
with plt.style.context('seaborn'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', palette='dark')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Duration for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Claim_Duration and Potentially Fraud for both the Genderswith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='Gender', palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Duration of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Claim_Duration and Potentially Fraud for Is_Alive?with plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='Is_Alive?', palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Duration patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Claim_Duration and Potentially Fraud for all Human Raceswith plt.style.context('seaborn-poster'):
plt.figure(figsize=(16,8))
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='Race', palette='cubehelix')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Duration of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc='upper center',title='Race');
OBSERVATION
Claim_Duration and Potentially Fraud for RenalDiseaseIndicatorwith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='Claim_Duration', hue='RenalDiseaseIndicator', palette='copper')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Duration of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title='RKD');
OBSERVATION
New Feature :: Admitted_Duration¶- Admitted Duration = Discharge Date - Admission Date
train_iobp_df['AdmissionDt'] = pd.to_datetime(train_iobp_df['AdmissionDt'], format="%Y-%m-%d")
train_iobp_df['DischargeDt'] = pd.to_datetime(train_iobp_df['DischargeDt'], format="%Y-%m-%d")
train_iobp_df['Admitted_Duration'] = (train_iobp_df['DischargeDt'] - train_iobp_df['AdmissionDt']).dt.days
with plt.style.context('seaborn'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', palette='Accent_r')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Admitted Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Admitted Duration for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Admit_Duration and Potentially Fraud for both the Genderswith plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='Gender', palette='inferno')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Admit Duration of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Admitted_Duration and Potentially Fraud for Is_Alive?with plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='Is_Alive?', palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Admit Duration patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title='Is_Alive?');
OBSERVATION
Admit_Duration and Potentially Fraud for all Human Raceswith plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='Race', palette='plasma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Admit Duration of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title="Race");
OBSERVATION
Admitted_Duration and Potentially Fraud for RenalDiseaseIndicatorwith plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Admitted_Duration', hue='RenalDiseaseIndicator',palette='magma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Admit Duration (in days)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Admit Duration of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title="RKD");
OBSERVATION
New Feature :: Bene_Age¶- Bene Age = DOD - DOB (if DOD is Null then replace it with MAX date in DOD)
# Filling the Null values as MAX Date of Death in the Dataset
train_iobp_df['DOD'].fillna(value=train_iobp_df['DOD'].max(), inplace=True)
train_iobp_df['Bene_Age'] = round(((train_iobp_df['DOD'] - train_iobp_df['DOB']).dt.days)/365,1)
with plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', palette='Pastel2')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Beneficiary Age (in years)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Beneficiary Age for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Bene_Age and Potentially Fraud for both the Genderswith plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='Gender', palette='inferno')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Bene_Age of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Bene_Age and Potentially Fraud for Is_Alive?with plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='Is_Alive?', palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Bene_Age patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title='Is_Alive?');
OBSERVATION
Bene_Age and Potentially Fraud for all Human Raceswith plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='Race', palette='plasma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Bene_Age of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title="Race");
OBSERVATION
Bene_Age and Potentially Fraud for RenalDiseaseIndicatorwith plt.style.context('seaborn-poster'):
fig = sns.violinplot(data=train_iobp_df, x='PotentialFraud',y='Bene_Age', hue='RenalDiseaseIndicator',palette='magma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Bene_Age (in years)\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Bene_Age of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
plt.legend(loc='lower center', title='RKD');
OBSERVATION
InscClaimAmtReimbursed influences Potentially Fraud?¶InscClaimAmtReimbursed and Potentially Fraudwith plt.style.context('seaborn'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', palette='flag')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Re-Imb Amount for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
InscClaimAmtReimbursed and Potentially Fraud for both the Genderswith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='Gender', palette='flare')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Re-Imb Amount of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
InscClaimAmtReimbursed and Potentially Fraud for Is_Alive?with plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='Is_Alive?', palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Re-Imb Amount patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title='Is_Alive?');
OBSERVATION
InscClaimAmtReimbursed and Potentially Fraud for all Human Raceswith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='Race', palette='plasma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Re-Imb Amount of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title="Race");
OBSERVATION
InscClaimAmtReimbursed and Potentially Fraud for RenalDiseaseIndicatorwith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df, x='PotentialFraud',y='InscClaimAmtReimbursed', hue='RenalDiseaseIndicator',palette='magma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Claim Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Claim Re-Imb Amount of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
plt.legend(loc='upper center', title='RKD');
OBSERVATION
IPAnnualReimbursementAmt influences Potentially Fraud?¶IPAnnualReimbursementAmt and Potentially Fraudwith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], y='PotentialFraud',x='IPAnnualReimbursementAmt',
palette='dark', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.yticks(rotation=90, fontsize=12)
plt.xlabel("\nAnnual IP Re-Imb Amount", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Annual IP Re-Imb Amount for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
IP Annual Re-Imb Amount as 0 for Admitted Patients?¶print(pd.DataFrame(train_iobp_df[(train_iobp_df['IPAnnualReimbursementAmt'] == 0)]['Admitted?'].value_counts()))
Admitted? 0 371263 1 413
IP Annual Re-Imb Amt is 0.print(pd.DataFrame(train_iobp_df[(train_iobp_df['IPAnnualReimbursementAmt'] == 0) & (train_iobp_df['Admitted?'] == 1)]\
['PotentialFraud'].value_counts()))
PotentialFraud Yes 249 No 164
OBSERVATION
IPAnnualReimbursementAmt and Potentially Fraud for both the Genderswith plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', hue='Gender',
palette='flare')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual IP Re-Imb Amount of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
IPAnnualReimbursementAmt and Potentially Fraud for Is_Alive?with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', hue='Is_Alive?',
palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual IP Re-Imb Amount patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title='Is_Alive?');
OBSERVATION
IPAnnualReimbursementAmt and Potentially Fraud for all Human Raceswith plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt', hue='Race',
palette='plasma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual IP Re-Imb Amount of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title="Race");
OBSERVATION
IPAnnualReimbursementAmt and Potentially Fraud for RenalDiseaseIndicatorwith plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 1], x='PotentialFraud',y='IPAnnualReimbursementAmt',
hue='RenalDiseaseIndicator',palette='magma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual IP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual IP Re-Imb Amount of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
plt.legend(loc='upper center', title='RKD');
OBSERVATION
OPAnnualReimbursementAmt influences Potentially Fraud?¶OPAnnualReimbursementAmt and Potentially Fraudwith plt.style.context('seaborn-poster'):
fig = sns.boxenplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], y='PotentialFraud', x='OPAnnualReimbursementAmt',
palette='dark', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.yticks(rotation=90, fontsize=12)
plt.xlabel("\nAnnual OP Re-Imb Amount", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Annual OP Re-Imb Amount for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
OP Annual Re-Imb Amount as 0 for Admitted Patients?¶print(pd.DataFrame(train_iobp_df[(train_iobp_df['OPAnnualReimbursementAmt'] == 0)]['Admitted?'].value_counts()))
Admitted? 1 3909 0 1009
OP Annual Re-Imb Amt is 0.print(pd.DataFrame(train_iobp_df[(train_iobp_df['OPAnnualReimbursementAmt'] == 0) & (train_iobp_df['Admitted?'] == 0)]\
['PotentialFraud'].value_counts()))
PotentialFraud No 617 Yes 392
OBSERVATION
OPAnnualReimbursementAmt and Potentially Fraud for both the Genderswith plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', hue='Gender',
palette='flare')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual OP Re-Imb Amount of both the genders for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.plot();
OBSERVATION
OPAnnualReimbursementAmt and Potentially Fraud for Is_Alive?with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', hue='Is_Alive?',
palette='prism')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual OP Re-Imb Amount patient life status for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title='Is_Alive?');
OBSERVATION
OPAnnualReimbursementAmt and Potentially Fraud for all Human Raceswith plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt', hue='Race',
palette='plasma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual OP Re-Imb Amount of all human races for Potentially Fraud & Non-Fraud Providers\n", fontdict=title_font_dict)
plt.legend(loc="upper center", title="Race");
OBSERVATION
OPAnnualReimbursementAmt and Potentially Fraud for RenalDiseaseIndicatorwith plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df[train_iobp_df['Admitted?'] == 0], x='PotentialFraud',y='OPAnnualReimbursementAmt',
hue='RenalDiseaseIndicator',palette='magma')
# Providing the labels and title to the graph
plt.xlabel("Potentially Fraud?", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Annual OP Re-Imb Amount\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Annual OP Re-Imb Amount of whether having RKD influence Potentially Fraud & Non-Fraud Providers?\n", fontdict=title_font_dict)
plt.legend(loc='upper center', title='RKD');
OBSERVATION
New Feature :: Total Number of false claims filed by a Provider¶- Logic :: COUNT(all claims submitted by a Provider) - COUNT(all non-fraud claims submitted by a Provider)
REASONING
The idea behind adding this feature is to introduce a way by which we can see how many fraud or non fraud claims been submitted by a provider.
Generally what has been observed in medicare frauds is that many small hospitals from rural places had been intentionally used for filing the false claims by giving them bribes or in desire of kickbacks. Thus, for such providers total claims submitted will be less but majority of them will be false.
But, the problem in the given dataset after joining(IP, OP, BENE with PRV TGT) is that

New Feature :: Total Number of claims or cases seen by Attending Physician¶# Total unique number of Attended Physicians
print("Unique number of Attending Physicians present in the dataset are --> {}".format(train_iobp_df['AttendingPhysician'].nunique()))
Unique number of Attending Physicians present in the dataset are --> 82063
train_iobp_df['Att_Phy_tot_claims'] = train_iobp_df.groupby(['AttendingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Att_Phy_tot_claims'].describe()
count 556703.000000 mean 138.634829 std 293.669039 min 1.000000 25% 7.000000 50% 33.000000 75% 116.000000 max 2534.000000 Name: Att_Phy_tot_claims, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Att_Phy_tot_claims'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Att_Phy_tot_claims'],color='green')
# Providing the labels and title to the graph
plt.xlabel("\nAttending Physicians Total Claims Submitted", fontdict=label_font_dict)
plt.xticks(np.arange(0,2800,100), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Total claims filed by Attending Physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Att_Phy_tot_claims', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,2800,100), rotation=90, fontsize=12)
plt.xlabel("\nAttending Physician Total Claims", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Total claims filed by Attending Physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Att_Phy_tot_claims may be useful in segregating the potentially fraud and non-fraudulent cases.New Feature :: Total Number of claims or cases seen by Opearting Physician¶# Total unique number of Operating Physicians
print("Unique number of Operating Physicians present in the dataset are --> {}".format(train_iobp_df['OperatingPhysician'].nunique()))
Unique number of Operating Physicians present in the dataset are --> 35315
train_iobp_df['Opr_Phy_tot_claims'] = train_iobp_df.groupby(['OperatingPhysician'])['ClaimID'].transform('count')
train_iobp_df['Opr_Phy_tot_claims'].describe()
count 114447.000000 mean 27.204811 std 52.687759 min 1.000000 25% 2.000000 50% 8.000000 75% 25.000000 max 424.000000 Name: Opr_Phy_tot_claims, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Opr_Phy_tot_claims'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Opr_Phy_tot_claims'],color='green')
# Providing the labels and title to the graph
plt.xlabel("\nOperating Physicians Total Claims Submitted", fontdict=label_font_dict)
plt.xticks(np.arange(0,500,20), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Total claims filed by Operating Physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Opr_Phy_tot_claims', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,500,20), rotation=90, fontsize=12)
plt.xlabel("\nOperating Physician Total Claims", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Total claims filed by Operating Physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Opr_Phy_tot_claims may be useful in segregating the potentially fraud and non-fraudulent cases.Att_Phy_tot_claims.New Feature :: Total Number of claims or cases seen by Other Physician¶# Total unique number of Other Physicians
print("Unique number of Other Physicians present in the dataset are --> {}".format(train_iobp_df['OtherPhysician'].nunique()))
Unique number of Other Physicians present in the dataset are --> 46457
train_iobp_df['Oth_Phy_tot_claims'] = train_iobp_df.groupby(['OtherPhysician'])['ClaimID'].transform('count')
train_iobp_df['Oth_Phy_tot_claims'].describe()
count 199736.000000 mean 90.207914 std 208.017235 min 1.000000 25% 3.000000 50% 15.000000 75% 60.000000 max 1247.000000 Name: Oth_Phy_tot_claims, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Oth_Phy_tot_claims'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Oth_Phy_tot_claims'],color='green')
# Providing the labels and title to the graph
plt.xlabel("Other Physicians Total Claims Submitted", fontdict=label_font_dict)
plt.xticks(np.arange(0,1450,50), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Total claims filed by Other Physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Oth_Phy_tot_claims', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,1500,50), rotation=90, fontsize=12)
plt.xlabel("Other Physician Total Claims", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Total claims filed by Other Physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Oth_Phy_tot_claims may be useful in segregating the potentially fraud and non-fraudulent cases.Att_Phy_tot_claims.# Simultaneously viewing the plots for better understanding
with plt.style.context('seaborn'):
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,8), sharey=False)
sns.boxplot(data=train_iobp_df[train_iobp_df["PotentialFraud"] == 'Yes'][['Att_Phy_tot_claims','Opr_Phy_tot_claims','Oth_Phy_tot_claims']],
ax=ax1, palette='viridis')
ax1.set_title("Potential Fraud = Yes", fontdict=title_font_dict)
ax1.set_xlabel("\nPhysicians Categories", fontdict=label_font_dict)
ax1.set_ylabel("Claims Filed", fontdict=label_font_dict)
sns.boxplot(data=train_iobp_df[train_iobp_df["PotentialFraud"] == 'No'][['Att_Phy_tot_claims','Opr_Phy_tot_claims','Oth_Phy_tot_claims']],
ax=ax2, palette='magma')
ax2.set_title("Potential Fraud = No", fontdict=title_font_dict)
ax2.set_xlabel("\nPhysicians Categories", fontdict=label_font_dict)
ax2.set_ylabel("Claims Filed", fontdict=label_font_dict)
# Providing the title to the figure
fig.suptitle("Distribution of Total Claims filed by Attending(Att), Operating(Opr) and Other(Oth) Physicians.\n", fontdict=title_font_dict)
plt.minorticks_on()
plt.plot();
OBSERVATION
Combined Feature :: Att_Opr_Oth_Phy_Tot_Claims¶It represents the total claims submitted by Attending, Operating and Other Physicians.
Reasoning :: The idea behind adding this feature is to see whether a total of physicians claims submission will help in influencing the potential frauds.Logic :: Att_Phy_tot_claims + Opr_Phy_tot_claims + Oth_Phy_tot_claimstrain_iobp_df['Att_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Opr_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Oth_Phy_tot_claims'].fillna(value=0, inplace=True)
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'] = train_iobp_df['Att_Phy_tot_claims'] + train_iobp_df['Opr_Phy_tot_claims'] + train_iobp_df['Oth_Phy_tot_claims']
train_iobp_df['Att_Opr_Oth_Phy_Tot_Claims'].describe()
count 558211.000000 mean 176.115666 std 379.833208 min 0.000000 25% 9.000000 50% 41.000000 75% 144.000000 max 3372.000000 Name: Att_Opr_Oth_Phy_Tot_Claims, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Att_Opr_Oth_Phy_Tot_Claims'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Att_Opr_Oth_Phy_Tot_Claims'],color='green')
# Providing the labels and title to the graph
plt.xlabel("\nAttending, Operating & Other Physicians Total Claims Submitted", fontdict=label_font_dict)
plt.xticks(np.arange(0,3600,100), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Total Claims filed by Attending, Operating & Other Physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Att_Opr_Oth_Phy_Tot_Claims', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,3600,100), rotation=90, fontsize=12)
plt.xlabel("\nAttending, Operating & Other Physicians Total Claims", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Total claims filed by Attending, Operating & Other Physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
3 New Features :: Prv_Tot_Att_Phy, Prv_Tot_Opr_Phy and Prv_Tot_Oth_Phy¶Reasoning :: The idea behind adding this feature is to see if a provider has wroked with very less or very high number of physicians then does that increases or decreases the chances of potential fraud.train_iobp_df["Prv_Tot_Att_Phy"] = train_iobp_df.groupby(['Provider'])['AttendingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Opr_Phy"] = train_iobp_df.groupby(['Provider'])['OperatingPhysician'].transform('count')
train_iobp_df["Prv_Tot_Oth_Phy"] = train_iobp_df.groupby(['Provider'])['OtherPhysician'].transform('count')
# Nulls in the above features
train_iobp_df.isna().sum().tail(3)
Prv_Tot_Att_Phy 0 Prv_Tot_Opr_Phy 0 Prv_Tot_Oth_Phy 0 dtype: int64
train_iobp_df["Prv_Tot_Att_Phy"].describe()
count 558211.000000 mean 820.206469 std 1271.272090 min 1.000000 25% 122.000000 50% 359.000000 75% 1013.000000 max 8207.000000 Name: Prv_Tot_Att_Phy, dtype: float64
train_iobp_df["Prv_Tot_Opr_Phy"].describe()
count 558211.000000 mean 155.030023 std 228.266693 min 0.000000 25% 25.000000 50% 73.000000 75% 185.000000 max 1441.000000 Name: Prv_Tot_Opr_Phy, dtype: float64
train_iobp_df["Prv_Tot_Oth_Phy"].describe()
count 558211.000000 mean 306.781194 std 497.246984 min 0.000000 25% 37.000000 50% 120.000000 75% 381.000000 max 3209.000000 Name: Prv_Tot_Oth_Phy, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Att_Phy'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Att_Phy'],color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nProviders interacted with how many attending physicians?", fontdict=label_font_dict)
plt.xticks(np.arange(0,8800,400), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with attending physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Att_Phy', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,8800,400), rotation=90, fontsize=12)
plt.xlabel("\nProviders interacted with how many attending physicians?", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with attending physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Prv_Tot_Att_Phy is high then chances of fraud is quite high.with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Opr_Phy'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Opr_Phy'],color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nProviders interacted with how many operating physicians?", fontdict=label_font_dict)
plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with operating physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Opr_Phy', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=12)
plt.xlabel("\nProviders interacted with how many operating physicians?", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with operating physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Prv_Tot_Opr_Phy is high then chances of fraud is quite high.with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Oth_Phy'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Oth_Phy'],color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nProviders interacted with how many other physicians?", fontdict=label_font_dict)
plt.xticks(np.arange(0,3600,200), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with other physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Oth_Phy', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,3600,200), rotation=90, fontsize=12)
plt.xlabel("\nProviders interacted with how many other physicians?", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with other physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Prv_Tot_Oth_Phy is high then chances of fraud is quite high.Combined Feature :: Prv_Tot_Att_Opr_Oth_Phys¶It represents the total of all kind of physicians that a provider has interacted with.
Reasoning :: The idea behind adding this feature is to see whether a fraudulent provider interacts with higher or lower numberof of various physicians.Logic :: Prv_Tot_Att_Phy + Prv_Tot_Opr_Phy + Prv_Tot_Oth_Phytrain_iobp_df['Prv_Tot_Att_Opr_Oth_Phys'] = train_iobp_df['Prv_Tot_Att_Phy'] + train_iobp_df['Prv_Tot_Opr_Phy'] + train_iobp_df['Prv_Tot_Oth_Phy']
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['Prv_Tot_Att_Opr_Oth_Phys'],color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['Prv_Tot_Att_Opr_Oth_Phys'],color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nProviders interacted with how many all kind of physicians?", fontdict=label_font_dict)
plt.xticks(np.arange(0,14000,1000), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with all kind of physicians", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='Prv_Tot_Att_Opr_Oth_Phys', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,14000,1000), rotation=90, fontsize=12)
plt.xlabel("\nProviders interacted with how many all kind of physicians?", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Providers interaction with all kind of physicians", fontdict=title_font_dict)
plt.plot();
OBSERVATION
Prv_Tot_Att_Opr_Oth_Phys is high then chances of fraud is quite high.New Feature :: Total Unique Claim Admit Codes used by a PROVIDER¶Reasoning :: The idea behind adding this feature is to see how many unique number of Claim Admit Diagnosis Codes used by the Provider. train_iobp_df['PRV_Tot_Admit_DCodes'] = train_iobp_df.groupby(['Provider'])['ClmAdmitDiagnosisCode'].transform('nunique')
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_Admit_DCodes'], color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_Admit_DCodes'], color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nProviders used unique Claim Admit Diagnosis Codes", fontdict=label_font_dict)
plt.xticks(np.arange(0,600,50), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Providers used unique Claim Admit Diagnosis Codes", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_Admit_DCodes', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,600,50), rotation=90, fontsize=12)
plt.xlabel("\nProviders used unique Claim Admit Diagnosis Codes", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Providers used unique Claim Admit Diagnosis Codes", fontdict=title_font_dict)
plt.plot();
OBSERVATION
PRV_Tot_Admit_DCodes is high then chances of fraud also increases.NOTE :: What didn't worked?
unique number of Admit Diagnosis Codes used by the 3 different class of physicians but the variation was very minimal, thus not added those features.New Feature :: Total Unique Number of Diagnosis Group Codes used by a PROVIDER¶Reasoning :: The idea behind adding this feature is to see how many unique Diagnosis Group Codes used by the Provider.train_iobp_df['PRV_Tot_DGrpCodes'] = train_iobp_df.groupby(['Provider'])['DiagnosisGroupCode'].transform('nunique')
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_DGrpCodes'], color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_DGrpCodes'], color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nProviders used unique Diagnosis Group Codes", fontdict=label_font_dict)
plt.xticks(np.arange(0,400,40), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Providers used unique Diagnosis Group Codes", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_DGrpCodes', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,400,40), rotation=90, fontsize=12)
plt.xlabel("\nProviders used unique Diagnosis Group Codes", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Providers used unique Diagnosis Group Codes", fontdict=title_font_dict)
plt.plot();
OBSERVATION
PRV_Tot_Admit_DCodes is high then it slightly increases the chances of fraud.NOTE :: What didn't worked?
unique number of Diagnosis Group Codes used by the 3 different class of physicians but the variation was very minimal, thus not added those features.NOTE :: What didn't worked?
in how many claims a unique Diagnosis Group Code is used but there was no variation at all, thus not added that feature. Kindly refer to the below image:
NOTE :: What didn't worked?
DOB -- Month, DOD -- Year and DOD -- Month in order to see whether we can find some pattern of bogus DOB or DOD but there was no variation at all, thus not added that feature. Also, raw DOB -- Year also showed no variation. Kindly refer to the below image:
New Feature :: Total unique Date of Birth years of beneficiaries provided by a Provider¶Reasoning :: The idea behind adding this feature is that if a provider has very high variability in the year of birth of patients then that might be one of the signs of medicare frauds.train_iobp_df['DOB_Year'] = train_iobp_df['DOB'].dt.year
train_iobp_df['PRV_Tot_Unq_DOB_Years'] = train_iobp_df.groupby(['Provider'])['DOB_Year'].transform('nunique')
train_iobp_df['PRV_Tot_Unq_DOB_Years'].describe()
count 558211.000000 mean 50.615590 std 18.190988 min 1.000000 25% 38.000000 50% 54.000000 75% 67.000000 max 75.000000 Name: PRV_Tot_Unq_DOB_Years, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_Unq_DOB_Years'], color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_Unq_DOB_Years'], color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nTotal unique Years of birth of patients", fontdict=label_font_dict)
plt.xticks(np.arange(0,80,4), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution Providers treated Patients of various DOB Years", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_Unq_DOB_Years', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,80,4), rotation=90, fontsize=12)
plt.xlabel("\nTotal unique Years of birth of patients", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution Providers treated Patients of various DOB Years", fontdict=title_font_dict)
plt.plot();
OBSERVATION
PRV_Tot_Unq_DOB_Years is very high than then it increases the chances of fraud as well.train_iobp_df[train_iobp_df['PRV_Tot_Unq_DOB_Years'] >=67]['PotentialFraud'].value_counts()
Yes 118873 No 22203 Name: PotentialFraud, dtype: int64
New Feature :: Sum of patients age treated by a Provider¶Reasoning :: The idea behind adding this feature is that there might be a pattern like if the sum of patients age treated by a provider is very high or low then it might influence the fraud.train_iobp_df['PRV_Bene_Age_Sum'] = train_iobp_df.groupby(['Provider'])['Bene_Age'].transform('sum')
train_iobp_df['PRV_Bene_Age_Sum'].describe()
count 558211.000000 mean 60903.124044 std 95028.202759 min 34.300000 25% 9007.200000 50% 26310.800000 75% 74869.500000 max 617454.100000 Name: PRV_Bene_Age_Sum, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Bene_Age_Sum'], color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Bene_Age_Sum'], color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nSum of patients age treated by Providers", fontdict=label_font_dict)
plt.xticks(np.arange(0,620000,50000), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of Sum of patients age treated by Providers", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Bene_Age_Sum', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,620000,50000), rotation=90, fontsize=11)
plt.xlabel("\nSum of patients age treated by Providers", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Sum of patients age treated by Providers", fontdict=title_font_dict)
plt.plot();
OBSERVATION
PRV_Bene_Age_Sum is high then it increases the chances of fraud.New Feature :: Sum of Insc Claim Re-Imb Amount for a Provider¶Reasoning :: The idea behind adding this feature is that there might be a pattern like if the sum of claim re-imb amount for a provider is very high or low then it might influence the fraud.train_iobp_df['PRV_Insc_Clm_ReImb_Amt'] = train_iobp_df.groupby(['Provider'])['InscClaimAmtReimbursed'].transform('sum')
train_iobp_df['PRV_Insc_Clm_ReImb_Amt'].describe()
count 5.582110e+05 mean 4.877429e+05 std 7.367223e+05 min 0.000000e+00 25% 6.369000e+04 50% 2.036000e+05 75% 5.969000e+05 max 5.996050e+06 Name: PRV_Insc_Clm_ReImb_Amt, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Insc_Clm_ReImb_Amt'], color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Insc_Clm_ReImb_Amt'], color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nSum of Insc Claim Re-Imb Amount for a Provider", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Sum of Insc Claim Re-Imb Amount for a Provider", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Insc_Clm_ReImb_Amt', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xlabel("\nSum of Insc Claim Re-Imb Amount for a Provider", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of Sum of Insc Claim Re-Imb Amount for a Provider", fontdict=title_font_dict)
plt.plot();
OBSERVATION
PRV_Insc_Clm_ReImb_Amt is high then it increases the chances of fraud.New Feature :: Total number of RKD Patients seen by a Provider¶Reasoning :: The idea behind adding this feature is that there might be a pattern like if the total number of RKD Patients seen by a Provider is very high or low then it might influence the fraud.train_iobp_df['RenalDiseaseIndicator'] = train_iobp_df['RenalDiseaseIndicator'].apply(lambda val: 1 if val == "Y" else 0)
train_iobp_df['PRV_Tot_RKD_Patients'] = train_iobp_df.groupby(['Provider'])['RenalDiseaseIndicator'].transform('sum')
train_iobp_df['PRV_Tot_RKD_Patients'].describe()
count 558211.000000 mean 157.902616 std 233.828365 min 0.000000 25% 24.000000 50% 73.000000 75% 192.000000 max 1447.000000 Name: PRV_Tot_RKD_Patients, dtype: float64
with plt.style.context('seaborn-poster'):
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'Yes']['PRV_Tot_RKD_Patients'], color='red')
sns.kdeplot(x=train_iobp_df[train_iobp_df['PotentialFraud'] == 'No']['PRV_Tot_RKD_Patients'], color='blue')
# Providing the labels and title to the graph
plt.xlabel("\nRKD Patients seen by a Provider", fontdict=label_font_dict)
plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=11)
plt.minorticks_on()
plt.title("Distribution of total number of RKD Patients seen by a Provider", fontdict=title_font_dict)
plt.legend(labels=["Yes", "No"], title="Potential Fraud?");
with plt.style.context('seaborn-poster'):
fig = sns.boxplot(data=train_iobp_df, y='PotentialFraud', x='PRV_Tot_RKD_Patients', palette='prism_r', orient='h')
# Providing the labels and title to the graph
plt.ylabel("Potentially Fraud?\n", fontdict=label_font_dict)
plt.xticks(np.arange(0,1600,100), rotation=90, fontsize=11)
plt.xlabel("\nRKD Patients seen by a Provider", fontdict=label_font_dict)
plt.minorticks_on()
plt.title("Distribution of total number of RKD Patients seen by a Provider", fontdict=title_font_dict)
plt.plot();
OBSERVATION
PRV_Tot_RKD_Patients is high then it increases the chances of fraud.Let's find some trends
Providers with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['Provider','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['Provider', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| Provider | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | PRV51001 | No | 25 | 345415 | 0.01 |
| 1 | PRV51003 | Yes | 132 | 212796 | 0.06 |
| 2 | PRV51004 | No | 149 | 345415 | 0.04 |
| 3 | PRV51005 | Yes | 1165 | 212796 | 0.55 |
| 4 | PRV51007 | No | 72 | 345415 | 0.02 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['Provider','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="Provider", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent Providers", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 Providers with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
Provider Num_of_cases Percentage 0 PRV51459 8240 3.87 1 PRV53797 4739 2.23 2 PRV51574 4444 2.09 3 PRV53918 3588 1.69 4 PRV54895 3436 1.61 5 PRV55215 3393 1.59 6 PRV52064 2844 1.34 7 PRV56011 2833 1.33 8 PRV55004 2399 1.13 9 PRV56560 2313 1.09 10 PRV57306 2315 1.09 11 PRV52030 2275 1.07 12 PRV52649 2156 1.01 13 PRV54772 2115 0.99 14 PRV52628 2098 0.99 15 PRV51369 2083 0.98 16 PRV51347 2067 0.97 17 PRV55039 2058 0.97 18 PRV57103 2049 0.96 19 PRV52019 1961 0.92 20 PRV51480 1924 0.90 21 PRV55462 1907 0.90 22 PRV52041 1885 0.89 23 PRV55467 1896 0.89 24 PRV54742 1892 0.89
OBSERVATION
Providers with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['Provider','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="Provider", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent Providers", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 Providers with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
Provider Num_of_cases Percentage 0 PRV53750 1245 0.36 1 PRV55552 1206 0.35 2 PRV53394 1215 0.35 3 PRV53871 1220 0.35 4 PRV52001 1177 0.34 5 PRV52104 1189 0.34 6 PRV56559 1113 0.32 7 PRV56006 1090 0.32 8 PRV54813 1056 0.31 9 PRV52631 1057 0.31 10 PRV51509 1030 0.30 11 PRV56270 1043 0.30 12 PRV54332 995 0.29 13 PRV52605 1000 0.29 14 PRV53702 1010 0.29 15 PRV57333 1004 0.29 16 PRV55525 933 0.27 17 PRV56248 934 0.27 18 PRV57348 920 0.27 19 PRV52395 912 0.26 20 PRV56243 909 0.26 21 PRV52859 911 0.26 22 PRV53700 913 0.26 23 PRV55510 861 0.25 24 PRV54374 854 0.25
OBSERVATION
Attending Physicians with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['AttendingPhysician','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['AttendingPhysician', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| AttendingPhysician | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | PHY311001 | No | 2 | 344471 | 0.0 |
| 1 | PHY311002 | Yes | 1 | 212232 | 0.0 |
| 2 | PHY311004 | No | 2 | 344471 | 0.0 |
| 3 | PHY311005 | No | 2 | 344471 | 0.0 |
| 4 | PHY311006 | No | 1 | 344471 | 0.0 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['AttendingPhysician','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="AttendingPhysician", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent AttendingPhysician", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 AttendingPhysician with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
AttendingPhysician Num_of_cases Percentage 0 PHY330576 2534 1.19 1 PHY350277 1628 0.77 2 PHY412132 1321 0.62 3 PHY423534 1223 0.58 4 PHY314027 1200 0.57 5 PHY327046 1181 0.56 6 PHY338032 1158 0.55 7 PHY357120 1156 0.54 8 PHY337425 1156 0.54 9 PHY341578 1133 0.53 10 PHY432650 1093 0.52 11 PHY347064 1076 0.51 12 PHY344389 1000 0.47 13 PHY383481 1005 0.47 14 PHY415321 1002 0.47 15 PHY433436 924 0.44 16 PHY375453 880 0.41 17 PHY387126 762 0.36 18 PHY357307 737 0.35 19 PHY318667 711 0.34 20 PHY424712 693 0.33 21 PHY347780 678 0.32 22 PHY313278 674 0.32 23 PHY323447 664 0.31 24 PHY333735 634 0.30
OBSERVATION
Attenting Physicians with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['AttendingPhysician','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="AttendingPhysician", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent AttendingPhysician", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 AttendingPhysician with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
AttendingPhysician Num_of_cases Percentage 0 PHY351121 1053 0.31 1 PHY375943 912 0.26 2 PHY432614 716 0.21 3 PHY389456 673 0.20 4 PHY326984 686 0.20 5 PHY362889 674 0.20 6 PHY373032 618 0.18 7 PHY367255 634 0.18 8 PHY356444 600 0.17 9 PHY360179 583 0.17 10 PHY405720 544 0.16 11 PHY342223 503 0.15 12 PHY387900 514 0.15 13 PHY430054 513 0.15 14 PHY361063 470 0.14 15 PHY326049 445 0.13 16 PHY388040 447 0.13 17 PHY351973 426 0.12 18 PHY318242 422 0.12 19 PHY328307 402 0.12 20 PHY403755 429 0.12 21 PHY322775 372 0.11 22 PHY374226 363 0.11 23 PHY424939 347 0.10 24 PHY366914 333 0.10
OBSERVATION
Operating Physicians with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['OperatingPhysician','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['OperatingPhysician', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| OperatingPhysician | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | PHY311005 | No | 1 | 67497 | 0.00 |
| 1 | PHY311010 | No | 1 | 67497 | 0.00 |
| 2 | PHY311011 | Yes | 5 | 46950 | 0.01 |
| 3 | PHY311014 | No | 3 | 67497 | 0.00 |
| 4 | PHY311018 | No | 2 | 67497 | 0.00 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['OperatingPhysician','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="OperatingPhysician", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent OperatingPhysician", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 OperatingPhysician with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
OperatingPhysician Num_of_cases Percentage 0 PHY330576 424 0.90 1 PHY424897 293 0.62 2 PHY314027 256 0.55 3 PHY423534 250 0.53 4 PHY357120 249 0.53 5 PHY412132 245 0.52 6 PHY327046 236 0.50 7 PHY381249 231 0.49 8 PHY333735 232 0.49 9 PHY341578 224 0.48 10 PHY429430 225 0.48 11 PHY337425 226 0.48 12 PHY383481 191 0.41 13 PHY347064 189 0.40 14 PHY432650 178 0.38 15 PHY344389 171 0.36 16 PHY415321 165 0.35 17 PHY433436 159 0.34 18 PHY341560 154 0.33 19 PHY387026 143 0.30 20 PHY387126 138 0.29 21 PHY357307 127 0.27 22 PHY404832 127 0.27 23 PHY347780 121 0.26 24 PHY411541 121 0.26
OBSERVATION
Operating Physicians with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['OperatingPhysician','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="OperatingPhysician", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent OperatingPhysician", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 OperatingPhysician with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
OperatingPhysician Num_of_cases Percentage 0 PHY387900 180 0.27 1 PHY351121 179 0.27 2 PHY375943 147 0.22 3 PHY367255 132 0.20 4 PHY432614 129 0.19 5 PHY362889 122 0.18 6 PHY326984 121 0.18 7 PHY356444 108 0.16 8 PHY360179 98 0.15 9 PHY405720 96 0.14 10 PHY373032 97 0.14 11 PHY321493 88 0.13 12 PHY342223 89 0.13 13 PHY318242 85 0.13 14 PHY319973 89 0.13 15 PHY326049 81 0.12 16 PHY366914 79 0.12 17 PHY361063 84 0.12 18 PHY388040 82 0.12 19 PHY322775 71 0.11 20 PHY313705 77 0.11 21 PHY403755 77 0.11 22 PHY351973 69 0.10 23 PHY405650 69 0.10 24 PHY415621 68 0.10
OBSERVATION
Other Physicians with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['OtherPhysician','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['OtherPhysician', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| OtherPhysician | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | PHY311001 | No | 1 | 125093 | 0.0 |
| 1 | PHY311003 | Yes | 2 | 74643 | 0.0 |
| 2 | PHY311005 | No | 2 | 125093 | 0.0 |
| 3 | PHY311006 | No | 3 | 125093 | 0.0 |
| 4 | PHY311007 | Yes | 1 | 74643 | 0.0 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['OtherPhysician','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="OtherPhysician", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent OtherPhysician", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 OtherPhysician with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
OtherPhysician Num_of_cases Percentage 0 PHY412132 1247 1.67 1 PHY341578 1098 1.47 2 PHY338032 1070 1.43 3 PHY337425 1041 1.39 4 PHY347064 806 1.08 5 PHY322092 771 1.03 6 PHY409965 744 1.00 7 PHY313818 730 0.98 8 PHY350277 682 0.91 9 PHY415321 678 0.91 10 PHY313278 625 0.84 11 PHY359122 614 0.82 12 PHY416093 538 0.72 13 PHY333735 496 0.66 14 PHY421058 432 0.58 15 PHY359931 414 0.55 16 PHY396637 400 0.54 17 PHY327964 371 0.50 18 PHY336024 372 0.50 19 PHY315344 363 0.49 20 PHY410597 357 0.48 21 PHY416732 356 0.48 22 PHY383336 344 0.46 23 PHY325906 293 0.39 24 PHY356051 283 0.38
OBSERVATION
Other Physicians with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['OtherPhysician','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="OtherPhysician", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent OtherPhysician", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 OtherPhysician with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
OtherPhysician Num_of_cases Percentage 0 PHY422235 369 0.29 1 PHY387900 351 0.28 2 PHY375943 328 0.26 3 PHY411722 313 0.25 4 PHY363309 305 0.24 5 PHY331484 295 0.24 6 PHY362889 294 0.24 7 PHY326984 292 0.23 8 PHY367255 269 0.22 9 PHY389456 271 0.22 10 PHY356444 258 0.21 11 PHY382485 249 0.20 12 PHY315424 243 0.19 13 PHY405720 220 0.18 14 PHY354358 217 0.17 15 PHY323501 202 0.16 16 PHY403755 196 0.16 17 PHY384394 196 0.16 18 PHY388477 197 0.16 19 PHY395822 195 0.16 20 PHY326286 189 0.15 21 PHY415621 182 0.15 22 PHY411662 189 0.15 23 PHY359164 182 0.15 24 PHY342223 184 0.15
OBSERVATION
ClmAdmitDiagnosisCode with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['ClmAdmitDiagnosisCode','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['ClmAdmitDiagnosisCode', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| ClmAdmitDiagnosisCode | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | 0030 | No | 1 | 83850 | 0.00 |
| 1 | 0030 | Yes | 1 | 62049 | 0.00 |
| 2 | 0059 | No | 1 | 83850 | 0.00 |
| 3 | 0059 | Yes | 2 | 62049 | 0.00 |
| 4 | 00845 | No | 31 | 83850 | 0.04 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['ClmAdmitDiagnosisCode','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="ClmAdmitDiagnosisCode", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent ClmAdmitDiagnosisCode", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 ClmAdmitDiagnosisCode with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
ClmAdmitDiagnosisCode Num_of_cases Percentage 0 42731 1529 2.46 1 V7612 1441 2.32 2 78605 1432 2.31 3 78650 1191 1.92 4 78900 1020 1.64 5 4019 1006 1.62 6 25000 873 1.41 7 486 843 1.36 8 78079 828 1.33 9 7802 793 1.28 10 7295 740 1.19 11 5990 703 1.13 12 V5883 691 1.11 13 4280 661 1.07 14 7242 600 0.97 15 7862 593 0.96 16 V5789 573 0.92 17 V5861 556 0.90 18 2724 549 0.88 19 V571 538 0.87 20 78097 536 0.86 21 78609 521 0.84 22 41401 507 0.82 23 78659 435 0.70 24 7804 417 0.67
OBSERVATION
ClmAdmitDiagnosisCode with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['ClmAdmitDiagnosisCode','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="ClmAdmitDiagnosisCode", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent ClmAdmitDiagnosisCode", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 ClmAdmitDiagnosisCode with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
ClmAdmitDiagnosisCode Num_of_cases Percentage 0 V7612 2633 3.14 1 42731 2105 2.51 2 4019 1726 2.06 3 78605 1560 1.86 4 25000 1495 1.78 5 78900 1316 1.57 6 V5883 1182 1.41 7 7295 1104 1.32 8 78650 1082 1.29 9 7242 997 1.19 10 V5861 980 1.17 11 2724 958 1.14 12 78079 951 1.13 13 5990 922 1.10 14 V571 905 1.08 15 7862 868 1.04 16 7802 820 0.98 17 4011 660 0.79 18 486 621 0.74 19 7804 622 0.74 20 41401 613 0.73 21 7245 596 0.71 22 78609 598 0.71 23 71946 573 0.68 24 7840 551 0.66
OBSERVATION
The above plot shows us the Top-25 'Claim Admit Diagnosis Code' with most percentage of Non-Fraudulent Case Submissions.
Main observation from the above 2 plots is that same Claim Admit Diagnostic Codes have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.
DiagnosisGroupCode with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['DiagnosisGroupCode','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['DiagnosisGroupCode', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| DiagnosisGroupCode | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | 000 | No | 57 | 17072 | 0.33 |
| 1 | 000 | Yes | 77 | 23402 | 0.33 |
| 2 | 001 | No | 2 | 17072 | 0.01 |
| 3 | 001 | Yes | 8 | 23402 | 0.03 |
| 4 | 002 | No | 9 | 17072 | 0.05 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['DiagnosisGroupCode','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="DiagnosisGroupCode", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent DiagnosisGroupCode", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 DiagnosisGroupCode with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
DiagnosisGroupCode Num_of_cases Percentage 0 882 111 0.47 1 166 107 0.46 2 186 102 0.44 3 192 98 0.42 4 945 95 0.41 5 884 97 0.41 6 939 96 0.41 7 202 95 0.41 8 883 96 0.41 9 188 97 0.41 10 168 94 0.40 11 949 94 0.40 12 204 94 0.40 13 885 92 0.39 14 876 90 0.38 15 196 89 0.38 16 198 89 0.38 17 183 89 0.38 18 887 90 0.38 19 950 90 0.38 20 177 86 0.37 21 946 87 0.37 22 947 86 0.37 23 164 87 0.37 24 184 84 0.36
OBSERVATION
DiagnosisGroupCode with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['DiagnosisGroupCode','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="DiagnosisGroupCode", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent DiagnosisGroupCode", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 DiagnosisGroupCode with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
DiagnosisGroupCode Num_of_cases Percentage 0 183 76 0.45 1 167 76 0.45 2 884 77 0.45 3 208 75 0.44 4 940 70 0.41 5 881 70 0.41 6 882 68 0.40 7 187 69 0.40 8 887 69 0.40 9 941 69 0.40 10 939 67 0.39 11 876 66 0.39 12 168 66 0.39 13 182 67 0.39 14 184 65 0.38 15 205 65 0.38 16 206 65 0.38 17 177 64 0.37 18 190 63 0.37 19 283 64 0.37 20 198 63 0.37 21 204 64 0.37 22 180 64 0.37 23 948 64 0.37 24 880 63 0.37
OBSERVATION
The above plot shows us the Top-25 'Diagnosis Group Code' with most percentage of Non-Fraudulent Case Submissions.
Main observation from the above 2 plots is that same Diagnosis Group Codes have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.
Age_groups have any relationship with maximum number of fraudulent cases?¶def bene_age_brackets(val):
"""
Description : This function is created for allocating the age groups based on Beneficiary Age.
"""
if val >=1 and val <=40:
return 'Young'
elif val > 40 and val <=60:
return 'Mid'
elif val > 60 and val <= 80:
return 'Old'
else:
return 'Very Old'
train_iobp_df['AGE_groups'] = train_iobp_df['Bene_Age'].apply(lambda age: bene_age_brackets(age))
tmp = pd.DataFrame(train_iobp_df.groupby(['AGE_groups','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['AGE_groups', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| AGE_groups | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | Mid | No | 35524 | 345415 | 10.28 |
| 1 | Mid | Yes | 21152 | 212796 | 9.94 |
| 2 | Old | No | 190334 | 345415 | 55.10 |
| 3 | Old | Yes | 116676 | 212796 | 54.83 |
| 4 | Very Old | No | 110885 | 345415 | 32.10 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['AGE_groups','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(10,8))
fig = sns.barplot(data=tmp_only_frauds, x="AGE_groups", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent AGE_groups", fontdict=label_font_dict)
plt.xticks(rotation=0, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("AGE_groups with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
AGE_groups Num_of_cases Percentage 0 Old 116676 54.83 1 Very Old 69780 32.79 2 Mid 21152 9.94 3 Young 5188 2.44
OBSERVATION
Age_groups have any relationship with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['AGE_groups','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(10,8))
fig = sns.barplot(data=tmp_only_non_frauds, x="AGE_groups", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent AGE_groups", fontdict=label_font_dict)
plt.xticks(rotation=0, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("AGE_groups with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
AGE_groups Num_of_cases Percentage 0 Old 190334 55.10 1 Very Old 110885 32.10 2 Mid 35524 10.28 3 Young 8672 2.51
OBSERVATION
The above plot shows us the percentage of Non-Fraudulent Case Submissions for various Age Groups.
Main observation from the above 2 plots is that same Age Groups have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.
States with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['State','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['State', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| State | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | 1 | No | 6715 | 345415 | 1.94 |
| 1 | 1 | Yes | 3525 | 212796 | 1.66 |
| 2 | 2 | No | 531 | 345415 | 0.15 |
| 3 | 2 | Yes | 207 | 212796 | 0.10 |
| 4 | 3 | No | 7314 | 345415 | 2.12 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['State','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="State", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent State Codes", fontdict=label_font_dict)
plt.xticks(rotation=0, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 State Codes with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
State Num_of_cases Percentage 0 5 30335 14.26 1 10 17512 8.23 2 33 17492 8.22 3 39 11448 5.38 4 45 10135 4.76 5 31 9112 4.28 6 49 8613 4.05 7 23 8538 4.01 8 14 8509 4.00 9 22 7798 3.66 10 44 6709 3.15 11 36 6381 3.00 12 26 5301 2.49 13 50 4782 2.25 14 15 4635 2.18 15 34 4385 2.06 16 11 4123 1.94 17 6 3666 1.72 18 1 3525 1.66 19 24 3453 1.62 20 42 3180 1.49 21 16 2733 1.28 22 21 2576 1.21 23 46 2124 1.00 24 3 2030 0.95
OBSERVATION
States with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['State','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(14,8))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="State", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent State Codes", fontdict=label_font_dict)
plt.xticks(rotation=0, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 State Codes with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
State Num_of_cases Percentage 0 45 23887 6.92 1 10 21561 6.24 2 5 21015 6.08 3 33 17532 5.08 4 14 15908 4.61 5 36 14910 4.32 6 34 14520 4.20 7 11 12880 3.73 8 39 12803 3.71 9 23 12805 3.71 10 15 9578 2.77 11 21 8685 2.51 12 18 8643 2.50 13 52 7839 2.27 14 44 7709 2.23 15 26 7610 2.20 16 42 7311 2.12 17 3 7314 2.12 18 50 6958 2.01 19 31 6828 1.98 20 1 6715 1.94 21 49 6384 1.85 22 19 6230 1.80 23 25 5863 1.70 24 22 5826 1.69
OBSERVATION
The above plot shows us the Top-25 State Codes with most percentage of Non-Fraudulent Case Submissions.
Main observation from the above 2 plots is that same State Codes have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.
Country with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['County','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['County', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| County | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | 0 | No | 6584 | 345415 | 1.91 |
| 1 | 0 | Yes | 4897 | 212796 | 2.30 |
| 2 | 1 | No | 8 | 345415 | 0.00 |
| 3 | 1 | Yes | 4 | 212796 | 0.00 |
| 4 | 10 | No | 10750 | 345415 | 3.11 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['County','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(15,10))
fig = sns.barplot(data=tmp_only_frauds.iloc[0:25], x="County", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent Country Codes", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 Country Codes with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
County Num_of_cases Percentage 0 200 10078 4.74 1 470 7048 3.31 2 400 5962 2.80 3 590 5814 2.73 4 0 4897 2.30 5 160 4865 2.29 6 620 4704 2.21 7 130 4551 2.14 8 490 4477 2.10 9 170 4362 2.05 10 440 4314 2.03 11 20 4174 1.96 12 90 4079 1.92 13 150 3844 1.81 14 290 3725 1.75 15 510 3505 1.65 16 310 3437 1.62 17 390 3387 1.59 18 331 3284 1.54 19 10 3232 1.52 20 141 3202 1.50 21 60 3181 1.49 22 700 3020 1.42 23 250 2999 1.41 24 530 2871 1.35
OBSERVATION
Country with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['County','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(15,10))
fig = sns.barplot(data=tmp_only_non_frauds.iloc[0:25], x="County", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=90)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent Country Codes", fontdict=label_font_dict)
plt.xticks(rotation=90, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Top-25 Country Codes with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
County Num_of_cases Percentage 0 10 10750 3.11 1 60 8814 2.55 2 20 8458 2.45 3 90 7007 2.03 4 0 6584 1.91 5 200 5879 1.70 6 150 5843 1.69 7 141 5793 1.68 8 400 5735 1.66 9 160 5668 1.64 10 310 5590 1.62 11 70 5487 1.59 12 50 5430 1.57 13 250 5385 1.56 14 40 5276 1.53 15 470 5230 1.51 16 480 5229 1.51 17 490 5010 1.45 18 100 4995 1.45 19 120 4936 1.43 20 240 4887 1.41 21 550 4513 1.31 22 30 4491 1.30 23 290 4417 1.28 24 390 4428 1.28
OBSERVATION
The above plot shows us the Top-25 Country Codes with most percentage of Non-Fraudulent Case Submissions.
Main observation from the above 2 plots is that same Country Codes have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.
Human Races have any relationship with maximum number of fraudulent cases?¶tmp = pd.DataFrame(train_iobp_df.groupby(['Race','PotentialFraud'])['BeneID'].count()).reset_index()
tmp.columns = ['Race', 'Fraud?', 'Num_of_cases']
tot_fraud_cases = tmp[tmp['Fraud?'] == 'Yes']['Num_of_cases'].sum()
tot_non_fraud_cases = tmp[tmp['Fraud?'] == 'No']['Num_of_cases'].sum()
tmp['Cases'] = tmp['Fraud?'].apply(lambda val: tot_non_fraud_cases if val == "No" else tot_fraud_cases)
tmp['Percentage'] = round(((tmp['Num_of_cases'] / tmp['Cases']) * 100),2)
tmp.head()
| Race | Fraud? | Num_of_cases | Cases | Percentage | |
|---|---|---|---|---|---|
| 0 | 1 | No | 292691 | 345415 | 84.74 |
| 1 | 1 | Yes | 178345 | 212796 | 83.81 |
| 2 | 2 | No | 35356 | 345415 | 10.24 |
| 3 | 2 | Yes | 20284 | 212796 | 9.53 |
| 4 | 3 | No | 10753 | 345415 | 3.11 |
tmp_only_frauds = tmp[tmp['Fraud?'] == 'Yes'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_frauds[['Race','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(10,8))
fig = sns.barplot(data=tmp_only_frauds, x="Race", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
# Providing the labels and title to the graph
plt.xlabel("\nTop Fraudulent Race", fontdict=label_font_dict)
plt.xticks(rotation=0, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Human Race with most number of fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
Race Num_of_cases Percentage 0 1 178345 83.81 1 2 20284 9.53 2 3 8962 4.21 3 5 5205 2.45
OBSERVATION
Human Races have any relationship with maximum number of non-fraudulent cases?¶tmp_only_non_frauds = tmp[tmp['Fraud?'] == 'No'].sort_values(by=['Percentage'], ascending=False).reset_index(drop=True)
print(tmp_only_non_frauds[['Race','Num_of_cases','Percentage']].head(25), "\n")
with plt.style.context('seaborn'):
plt.figure(figsize=(10,8))
fig = sns.barplot(data=tmp_only_non_frauds, x="Race", y="Num_of_cases", palette='Accent')
# Using the "patches" function we will get the location of the rectangle bars from the graph.
## Then by using those location(width & height) values we will add the annotations
for p in fig.patches:
width = p.get_width()
height = p.get_height()
x, y = p.get_xy()
fig.annotate(f'{str(round((height*100)/tot_non_fraud_cases,2))+"%"}', (x + width/2, y + height*1.025), ha='center', fontsize=13.5, rotation=0)
# Providing the labels and title to the graph
plt.xlabel("\nTop Non-Fraudulent Race", fontdict=label_font_dict)
plt.xticks(rotation=0, fontsize=12)
plt.ylabel("Number (or % share) of Cases\n", fontdict=label_font_dict)
plt.minorticks_on()
plt.grid(which='major', linestyle="--", color='lightgrey')
plt.title("Human Race with most number of non-fraudulent cases\n", fontdict=title_font_dict)
plt.plot();
Race Num_of_cases Percentage 0 1 292691 84.74 1 2 35356 10.24 2 3 10753 3.11 3 5 6615 1.92
OBSERVATION
The above plot shows us the percentage of Non-Fraudulent Case Submissions for various Human Races.
Main observation from the above 2 plots is that same Human Races have similar percentages for false and no-false claims. Therefore, it feels like this feature might not be very useful.